========================================================
This project is an effort towards partial fulfilment of the requirements for the Udacity’s Data Analyst Nanodegree.
The purpose is to perform an Exploratory Data Analysis (mono-, bi- and multivariate) on a dataset containing physicochemical measurements and tasting results of a sample of red wines.
Our goal with this dataset is to investigate how the chemical qualities of the wine affect its quality. Ideally, we would be able to come up with a regression model that will enable us to predict the quality of wine given its chemical properties.
As the authors of the dataset mention in their notes, it is possible that there are correlations between some of the measured quantities. Therefore, in the course of our work we will try to detect any such interactions between the variables.
We start by loading the dataset (stored in a CSV file) into a data.frame:
Some information about the number and the structure of the observations
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
We see that there are 1599 observation of wines, each one containing 13 variables. We can safely remove the “X” column as it simply repeats the natural index of the observations.
The rest of the variables are numeric. This is valid for most of them, since they are results of laboratory measurements. However, the “quality” variable is the tasting result, which is categorical variable having an ordinal scale. Therefore, we can convert this variable to an ordered factor, assuming that a higher number indicates a higher quality.
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
In this last “str” output we clearly see that “quality” is now represented by a 6-level ordered factor with the following levels:
## [1] "3" "4" "5" "6" "7" "8"
We test the data for any missing (NA) or numerically bad (NaN) data–there is no such data (TRUE).
Otherwise, the dataset is already in a “tidy” state–a row per observation, with all observation attributes in separate columns.
In this section we perform initials statistical and graphical analysis of the variables contained in the dataset.
In the plots below, the red vertical lines correspond to the median value of the variable, while the brown dashed vertical lines mark the 95% percentile of the data.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The fixed acidity features somewhat normal distribution with some right-skewedness (skewness = 0.9809084) and relatively long tail (excess kurtosis = 1.1196987).
The boxplot identifies many values as outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The volatile.acidity has a similar right-skewed distribution (skewness = 0.6703331) and a similarly long tail (kurtosis = 1.2126893).
The distribution seems a bit multi-modal. We can see this on a higher-resolution histogram.
The boxplot again identifies several values as outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
The citric.acid variable has many values equal to 0, as well as one value equal to 1 (also shown by the boxplot as an outlier). This outlier can be either a data entry error, or a wine that has excessive amount of citric acid (more than the limit of the measuring instrument).
As the value table below shows, the resolution of the measurement has been only 0.01 g/dm3, which means that all wines containing less than that will appear as 0 in the dataset.
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 132 33 50 30 29 20 24 22 33 30 35 15 27 18 21
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 19 9 16 22 21 25 33 27 25 51 27 38 20 19 21
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 30 30 32 25 24 13 20 19 14 28 29 16 29 15 23
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 22 19 18 23 68 20 13 17 14 13 12 8 9 9 8
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 9 2 1 10 9 7 14 2 11 4 2 1 1 3 4
## 0.75 0.76 0.78 0.79 1
## 1 3 1 1 1
We can get a better idea about the distribution by eliminating values 0 and 1. We will also adjust the binwidth to correspond to the measurement resolution.
This distribution appears artificially uniform towards the lower values (i.e. there is no gradual reduction of the frequency of the lower values). This can be probably explained with the fact that the winemakers usually add some amount of citric acid to give the wine a fresh (non-flat) body. Further confirmation for this is the presence of three very prominent peaks in the histogram (at 0.02, 0.24 and 0.49); especially the last two can signify that the citric amount has been artificially boosted to 0.25 or 0.50. If the citric acid amounts were due to a natural process (like fermentation, as is the case with the other acids), then its distribution would be more bell-shaped (the central limit theorem calls for normal distribution of the combined effect of many random processes.)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The IQR range (0.7) of the residual.sugar values is pretty small compared to the total range (14.6). This signifies that we’re dealing with wines of mostly the same class of sweetness; which is not surprising given that the “Vinho Verde” region produces predominantly very fresh wines.
We can classify the wines in terms of sweetness according to the scale mandated by EU directive 753/2002. This scale runs like this:
| Sugar content [g/dm3] | < 4 | (4, 12] | (12, 45] | > 45 |
|---|---|---|---|---|
| Sweetness | Dry | Medium Dry | Medium | Sweet |
We create a new ordered factor called sweetness in the original dataframe having levels corresponding to the sweetness degrees above. As the frequency table below shows most of the wines are “dry”.
##
## dry medium.dry medium sweet
## 1474 117 8 0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The amount of NaCl (chlorides) also shows a bell-shaped distribution with a long tail (excess kurtosis = 1.2126893). Most values are again concentrated in a small region (IQR=0.02 compared to range=14.6). This can be explained by the fact that all wines come from a geographically constrained region; it is known that the soil type and micro-climatic conditions of the region have direct influence on the salinity of the wine. Interestingly, some countries limit the amount of chlorides accepted in a wine (e.g. in Brazil it is 0.2 g/dm3, while in Australia it is 0.6 g/dm3). Generally, levels above 0.5 g/dm3 start to give the sensory perception of saltiness (although it depends on the national diet). There is one “outlier” wine in the sample that can trigger a “salty” grimace of the taster.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
The winemakers use SO2 as an antioxidant and disinfectant. Before bottling the wine they adjust the levels of free SO2 usually between 10 and 40 mg/dm3 (careful producers relate the amount of SO2 to the pH–red wines with low pH need less SO2 than ones with higher pH). This range is also pretty visible in the histogram above.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
In Europe the permitted level of total SO2 is 150 mg/dm3 for dry red wines (200 mg/dm3 for sugar levels above 5 g/dm3). This requirement is also visible from the graphs above. There are two wines that surpass significantly this limit (theoretically they can’t be sold on the EU market, but this is possible for the US where the permitted level is 300 mg/dm3).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
The density variable follows a normal distribution (Shapiro-Wilk gives 0.9908655 at P=1.936052810^{-8}). The Normal Q-Q plot below confirms this, but indicates also some heavier tails (hence the outliers in the boxplot).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The pH variable also follows a normal distribution (Shapiro-Wilk gives 0.9934863 at P=1.712237310^{-6}). The Normal Q-Q plot below confirms this, but indicates also some heavier right tail (hence the outliers in the boxplot).
The mean pH lies at 3.3 which is consistent with the observation that the wines of “Vinho Verde” are pretty acid and fresh.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The sulfates variable (measuring predominantly the amount of K2SO4) has a bell-shaped right-skewed distribution, with longer right tail.
Normally fresh (not very old) wines contain around 0.4-0.7 g/dm3 of K2SO4 which is very well seen in the histogram (median 0.62).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The alcohol content shows a significant right-skewedness (skewness = 0.8592144). As for all right-skewed distributions, this is due to a value limit situated on the left. This limit is the stipulation of the European legislation for minimum alcohol content to be 8.5%. The maximum cannot surpass 15%. We see this range clearly in the statistics above.
The quality variable is categorical, so we will plot it in a barplot:
## 3 4 5 6 7 8
## 10 53 681 638 199 18
We see that the majority of wines are in the middle quality range.
The dataset contains 1599 observations of red wines from the Portuguese region “Vinho Verde”. Each observation has 11 inputs (chemical measurements) and 1 output (subjective sensory assessment of quality).
All laboratory measurements (input variables) are numeric and have the following meaning (in brackets are the units of measurement):
The output variable is the factor quality having the following ordinal scale: 0 (lowest) to 10 (highest). The value of this variable is obtained as the median of three subjective sensory tasting assessments performed by different oenologists. The measured wines are in the range 3-8, i.e. no wine was classified as “very bad” or “excellent”.
We’re interested in finding a relationship that can predict the wine quality based on the various chemical measurements. It is not very easy to make apriori assumptions as to which chemical variables have a significant effect on the sensory perception of the wine. Still, what comes first to mind is the alcohol content and the pH, both being aspects that are very easily perceived. Of the other chemical substances, the measured chloride contents is mostly below the detection threshold. Likewise, sulphur dioxide (aka sulfites) have neither smell, nor taste, so it is not very likely that they can directly determine the tasting assessment. On the other hand, the sulphates can have very different tastes depending on their concentration, as shown on the diagram below.
Citric acid gives a freshness to the wine, and as such will probably affect positively the tasting. Contrarily to it, the acetic acid (quantified by the volatile.acidity variable) will give an unpleasant characteristic vinegar taste. We have to keep in mind that higher acidity will correlate with low levels of pH, therefore it is important to distinguish the case when low pH is due to the bad “acetic” acid.
A new categorical variable sweetness was created to classify each wine according to its residual sugar content. It has the following levels: “dry” < “medium dry” < “medium” < “sweet”.
There are two rather normal distributions: pH and density.
Most of the distributions, however, are right-skewed. This normally happens when we measure quantities that cannot be below a given limit (e.g. due to some law requirements, as is the case with the alcohol contents). Such distributions can usually benefit from a log transformation, as this increases their normality. An example is given below for the total SO2 variable.
Before the log transformation, the total SO2 variable has a normality statistics, as determined by the Shapiro-Wilk test, equal to 0.8732246. After the log transformation, Shapiro-Wilk gives 0.9899255, i.e. the transformed variable has more normal distribution. The histograms before-and-after below confirm this:
The most unusual distributions is the one of the citric acid. For didactic purposes, an attempt will be made to bring to normalization the distribution of the citric acid. One way to do this is to use the Box-Cox transformation, which is a type of Power Transform (as discussed here used frequently in many areas involving statistical analysis).
First, we plot the Q-Q normality plot of the original citric acid variable. It shows a rather non-normal distribution.
Then we perform the Box-Cox transformation with a plot of the Log-Likelihood profile
Finally we plot the Q-Q plot for the transformed variable, as well a superposition of the original and transformed densities.
It seems that the Box-Cox transformation in this case does not produce a very convincing result…
To achieve a quick overview of all bi-variate relations, we’ll produce a scatterplot matrix:
From the boxplots in this matrix we easily see that our variable of interest (quality) has very pronounced positive correlation with citric acid, sulphates and alcohol. It has negative correlation with volatile acidity, chlorides, density and pH. Its relation to the rest of the input variables seems less clearly defined.
A better view of the dependence of quality on the input variables can be obtained from the figures below. They plot conditionally the median of each input variable at each level of quality:
For an easier identification of the remaining significant correlations, a correlogram can help:
We see a very clear trend of increasing sensory quality with higher alcohol content. Amazingly, a similar trend is seen for the sulphates. Volatile acidity correlates negatively with quality, which was expected, given the unpleasant taste of the acetic acid. Citric acid on the other hand has a positive correlation, which is also not surprising, giving its “freshness” sensory effect. An interesting observation is that an increase in the chlorides leads to quality deterioration, even if the amounts are below the taste thresholds. The other variables–fixed acidity, SO2 contents, and residual sugar do not have a clear relationship with the quality perception.
The correlogram shows pretty strong positive correlation (blue color) between:
Fixed acidity and citric acid (cor=0.6717034)—not surprising, since the amount of citric acid is already included in the measurement of the total “good” acids content.
Fixed acidity and density (cor=0.6680473)—this is due to the fact that the three main acids in wine (citric, malic and tartaric) have higher density than water (1.67, 1.61 and 1.79 g/cm3), therefore their presence increases the overall density of the wine.
The two SO2 measurements (cor=0.6676665)—normal, since the free SO2 is included also in the total measurement.
Residual sugar and density (cor=0.3552834)—as sugar is heavier than water (1.587 Kg/L vs. 1.000 kg/L), its presence tends to increase the overall density of the wine.
Strong negative correlation (red color) was found between:
Fixed acidity and pH (cor=-0.6829782)—as expected, since higher acidity manifests itself in lower pH.
The same for citric acid and pH (cor=-0.5419041).
Density and alcohol content (cor=-0.4961798)—since alcohol has lower specific density than water (0.780 kg/L compared to 1.000 kg/L), its presence tends to reduce the overall density (all other factors being equal).
Somewhat unexpected is the positive correlation seen for volatile acidity (acetic acid) and pH (normally all acids cause lower pH). This can be due to the following phenomenon (see here): The “acetobacter aceti” bacteria responsible for the vinegar fermentation and the production of acetic acid thrives in wines with lower fixed acidity and SO2 levels. Therefore, wines with higher volatile acidity will tend to have lower fixed acidity (as we see in the negative correlation between fixed/citric and volatile acidity below).
This lower fixed acidity probably is not offset by the higher content of volatile acidity (acetic acid is a very weak acid), and therefore the pH tends to rise. This acidity-pH relationship is illustrated in the diagrams below:
Visually judging from the plots above, the strongest relationship could be the one between the amount of citric acid and quality.
An interesting experiment is to superimpose the density plots of the different variables that can be assumed to be predictors, conditional to the quality factor.
Another possibility is to visually explore a given chemical measurement across both quality and sweetness levels. This can be justified since the sweetness is one of the first factors by which people evaluate wines.
One issue with these plots is that there are very few observations including “medium” wines–it would be best probably to completely ignore the visualizations for this factor level.
The MV plots confirmed the already observed relationships of some variables to quality. In this case we see clearly from the density plots that the distribution modes of these variables change monotonously (increase for citric.acid, alcohol and sulphates; decrease for volatile.acidity) across the increasing levels of quality.
From the density plots we also see that the variances of some of the variables (like citric.acid and sulphates) at different quality levels are very similar. For others (volatile.acidity and alcohol) they are different, and interesting enough, they differ in opposite ways—alcohol variance increases with quality level, while volatile.acidity’s variance decreases.
What this means is that the lower-quality wines can have very different amounts of acetic acid; meaning in turn that there is another more potent factor that comes into play and downgrades the wine. One such factor could be the amount alcohol–if it is low, then the wine goes to the bad side, even if there is little acetic acid in it.
On the reverse, high quality wines can have very different values of alcohol content. This could be due to another factor that keeps the quality high despite the varying alcohol content. Maybe this is a low amount of acetic acid.
To test this hypothesized correlation between alcohol and acetic acid levels we can plot their scatterplot faceted by the quality level.
Indeed we see that there is significant correlation between alcohol and acetic acid content at the two quality extremes. So, the “very bad” quality wines steadily feature a low alcohol content; similarly, the “excellent” wines mostly have low volatile acidity.
Since the output variable is categorical (and have an ordinal scale), we have to consider an ordered logistic regression (see here) approach to modeling the quality behavior. In other words we cannot use a method like “lm” or OLS (ordinary least squares) regression which operate on a continuous dependent variable.
We will use the polr command from the MASS package to estimate an ologit regression model. This command relies on the “proportional odds assumptions”, i.e. it assumes the relationships between all pairs of outcome groups to be the same (in a robust study this is something to be tested, however it is far beyond the scope of the course).
In choosing the predictor variables for the regression, it is important that they are not correlated. I.e. it is of no use feeding the model both with total.sulfur.dioxide and free.sulfur.dioxide. As the analyses above shown the quality output is considerably affected by predictors such as citric.acid, volatile.acidity, sulphates, and alcohol. We will create a number of proportional odds regression models, using different combinations of predictors, and will compare their predictive strengths. The goodness-of-fit is measured via an invented indicator–predict.rate–which is simply the percentage of correctly predicted quality values against the total number of observations (another possible measure could be for example MAD (mean absolute deviation)).
The models will try are the following:
After fitting these models, a summary table will be printed sorted by increasing value of the “prediction rate”.
According to the table above, the best model (i.e. the one having the best prediction rate) is:
## [1] "quality ~ citric.acid + volatile.acidity + log10(alcohol)"
It also has a relatively low AIC (Akaike Information Criterion), meaning that its “difficulty (information loss) to goodness-of-fit” ratio is one of the better among the tried models.
The coefficients and other fitting statistics of this model are:
## Call:
## polr(formula = best_model$formula, data = w, Hess = T)
##
## Coefficients:
## Value Std. Error t value
## citric.acid 0.06291 0.3129 0.2011
## volatile.acidity -4.01343 0.3680 -10.9067
## log10(alcohol) 23.23954 1.3390 17.3565
##
## Intercepts:
## Value Std. Error t value
## 3|4 15.7292 1.4057 11.1895
## 4|5 17.6521 1.3736 12.8513
## 5|6 21.2355 1.3834 15.3500
## 6|7 23.9430 1.4242 16.8117
## 7|8 26.8746 1.4603 18.4029
##
## Residual Deviance: 3182.95
## AIC: 3198.95
We see that the predictive power of these models is far from satisfactory (< 60%). Probably, we can have better results using a higher-order regression model or some machine learning approach—for example, Random Forests or Support Vector Machines (SVM) come to mind.
As a first plot we will show the distribution (as a density) of all the variables. This is the most direct way to get an immediate all-encompasing overview of the major mono-variate properties of the measurements.
We can glean the following information about the variables and their behavior (summarized here, treated in detail above in the text):
| Variable | Shapiro-Wilk statistics | p-Value |
|---|---|---|
| chlorides | 0.4842466 | 1.179055810^{-55} |
| density | 0.9908655 | 1.936052810^{-8} |
| pH | 0.9934863 | 1.712237310^{-6} |
| residual sugars | 0.5660771 | 1.020161710^{-52} |
In fact this test confirms the normality only of pH and density, but not of the other two variables. This can be because they have very long and thin right tails (i.e. a lot of outliers). This is really visible on the boxplots for these two measurements in the mono-variate section above. If we repeat the Shapiro-Wilk test only on the data below the 95% quantile, we can really see that the bulk of chloride and residual sugar data is normally distributed:
| Variable | Shapiro-Wilk statistics | p-Value |
|---|---|---|
| chlorides < 95% | 0.9867341 | 1.41759310^{-10} |
| residual sugars < 95% | 0.8834183 | 2.712635710^{-32} |
The normality of such physicochemical measurements indicates that there is a naturally progressing physical/chemical phenomenon responsible for the value of the given measurement. Such a phenomenon could be the fermentation (which can be viewed as the summed and overlapping action of millions of bacteries), or the intake of chlorides from the soil (which is the result of the osmotic action of the millions of small root hairs of the vine plant).
| Variable | Pearson skewness |
|---|---|
| alcohol | 0.8592144 |
| citric.acidity | 0.3177403 |
| fixed.acidity | 0.9809084 |
| volatile.acidity | 0.6703331 |
| free.sulfur.dioxide | 1.248222 |
| total.sulfur.dioxide | 1.512689 |
| sulphates | 2.4241176 |
| volatile.acidity | 0.6703331 |
The right-skewedness, contrary to the normal distribution, is more typical of processes having an artificially set lower limit—this could be a legislative requirement (as for the alcohol content), or technological depedendancy (such as the minimal levels of SO2 needed for reliable protection of the wine)
As a second plot we will provide a correlation matrix. This is the quickest and most direct way to view the bi-variate relationships (quantified as correlations) between the variables. Red is negative correlation, blue is positive. The more saturated a color is, the stronger the correlation. The correlation coefficients are given with their 95% CI. As identified in the section for Bi-variate analysis, there is a physical/chemical causation explaining most of the strong correlations (e.g. higher fixed acidity translates into a lower pH).
The following table gives the strongest positive correlations:
As already seen, the strongest positive correlations are between measurements that include/repeat each other, as well as between density and fixed.acidity (physically explainable given the densities of the wine acids).
The following table gives the strongest negative correlations:
Strongest negative correlation is seen between pH and the acidities (since pH is a measurement of the acidity), and between alcohol and density (again explained with the relative density of alcohol).
Identifying the correlated variables is important before performing a regression modeling, since we have to avoid feeding the model with predictors that are correlated (danger of multicollinearity).
The final plot prepares us to select the variables that can serve as inputs for a quality-predicting model (like logistic regression or machine learning algorithms)
In this plot we show how the different physicochemical quantities behave at different quality levels. For this we draw their median values conditioned on the quality level.
We see directly that some of the quantities have a monotonous relationship to quality. These physicochemical properties make the best candidates for predictors of the quality output.
The following table summarizes the observed relationships:
| Input variable | Relationship to quality |
|---|---|
| fixed acidity | non-monotonous |
| volatile acidity | decreasing, perfectly monotonous |
| citric acid | increasing, perfectly monotonous |
| free SO2 | non-monotonous |
| total SO2 | non-monotonous |
| residual sugar | non-monotonous |
| sulphates | increasing, monotonous |
| chlorides | decreasing, somewhat monotonous |
| alcohol | increasing, monotonous |
| pH | decreasing, somewhat monotonous |
| density | decreasing, somewhat monotonous |
We see that the best quality predictors candidates are volatile acidity, citric acid, sulphates and alcohol. The previous correlation analysis also shows that there is no strong correlation between these quantities, so any combination of them can serve as an input to a regression model.
Finding a way to predict the sensory quality of wines based on their physicochemical properties can fulfill a dream of winemakers and oenologists.
Using datamining techniques operating on large freely available datasets of wine measurements can probably achieve this.
We can go even further and imagine models that can predict the origin and sort of the grapes from the chemical measurements; or even the climatic quality of the vintage year.
With the given dataset of red wine observations from the Vinho Verde region, it was already possible to determine some chemical properties that can be used as predictors for the quality output. The most apparent ones are the alcohol content, the volatile acidity, the amount of citric acid and of sulphates. Especially the latter was an unexpected relationship.
An attempt was made to obtain an ordinal proportional-odds regression model, using the thus identified predictors. The goodness-of-fit of the found model was not deemed satisfactory, suggesting that the use of more advanced methods like Random Forests can lead to better results.
Unfortunately the dataset does not include any measurements of tannin content; yet tannins (the natural polyphenols giving the astringent taste of wine) are very important for the sensory perception of wines.
When a wine is salty, and why it shouldn’t be
Chloride concentration in red wines: influence of terroir and grape type
Box-Cox Normality Transformation
Grainger K., Tattersall H., Wine Production and Quality, Wiley, 2016